The application of machine learning to predict high-cost patients: A performance-comparison of different models using healthcare claims data

Our aim was to predict future high-cost patients with machine learning using healthcare claims data. We applied a random forest (RF), a gradient boosting machine (GBM), an artificial neural network (ANN) and a logistic regression (LR) to predict high-cost patients in the following year. Therefore, we exploited routinely collected sickness funds claims and cost data of the years 2016, 2017 and 2018. Various specifications of each algorithm were trained and cross-validated on training data (n = 20,984) with claims and cost data from 2016 and outcomes from 2017. The best performing specifications of each algorithm were selected based on validation dataset performance. For performance comparison, selected models were applied to unforeseen data with features of the year 2017 and outcomes of the year 2018 (n = 21,146). The RF was the best performing algorithm measured by the area under the receiver operating curve (AUC) with a value of 0.883 (95% confidence interval (CI): 0.872–0.893) on test data, followed by the GBM (AUC = 0.878; 95% CI: 0.867–0.889). The ANN (AUC = 0.846; 95% CI: 0.834–0.857) and LR (AUC = 0.839; 95% CI: 0.826–0.852) were significantly outperformed by the GBM and the RF. All ML algorithms and the LR performed ´good´ (i.e. 0.9 > AUC ≥ 0.8). We were able to develop machine learning models that predict high-cost patients with ‘good’ performance facilitating routinely collected sickness fund claims and cost data. We found that tree-based models performed best and outperformed the ANN and LR.


Introduction
The patterns of health service utilization as well as the resulting costs vary substantially across individuals within a given population. The fact that five percent of the population account for about half of the total populations healthcare costs is known to hold for various countries such as the US [1,2], Germany [3,4], Canada [5], Denmark [6], Japan [7], the Netherlands [8] or Australia [9]. The top five percent of patients among the cost distribution are referred to as high-cost patients (HCPs). Compared to non-HCPs, HCPs are more likely to have a low the health insurance. The authors did not have special access privileges to the health insurances.
To request the data please contact (aok@rh.aok. de). To fulfill the legal requirements to obtain the data, data users must conclude a contract with the statutory health insurer regarding data access. The licensee is permitted to use the data for the purpose as set out in the contract. Licensees are not allowed to pass the data to a third party. For assistance in obtaining access to the data, please contact the corresponding author (BL).

Data source and study population
The data used for this study stems from a statutory health fund with insured individuals in the area of Hamburg, Germany. Data access for this research project was granted to the data processing organization (OptiMedis) based on a bilateral contract with the statutory health insurance company. The dataset provided health related data over a period of three years, namely the years 2016 to 2018. The dataset with predictor variables of 2016 and outcomes of 2017 included observations of 20,984 individuals, while the dataset with predictor variables of 2017 and outcomes of 2018 consisted of 21,146 individuals. The dataset included information on demographics, care dependency, disease management program (DMP) participation, out-and inpatient diagnosis as well as data on the prescription of drugs for each insured individual. Further, the costs for each in-or outpatient treatment as well as for prescriptions were available. In the dataset obtained from OptiMedis, no missing values were present.
The diagnosis included in the dataset for in-and outpatient care are based on the ICD-10-GM (German modification). For the analysis in this paper, one-and three-digit level ICD-10-GM codes were included. The restriction to these levels was made as a trade-off between the number of variables, computation time and the precision of the diagnosis included. Consequentially, we included 243 variables reflecting diagnostic categories. We further added two variables indicating the number of different in-and outpatient diagnosis of an individual within a given year.
In addition, Anatomical Therapeutic Chemical (ATC) classification system groups for prescription drugs were included. For ATC codes, we focused on the three digits level for the purpose of building subgroups. This restriction led to 95 subgroups of pharmacological properties, also referred to as "therapeutic subgroup[s]" [29]. Furthermore, 14 "anatomical main group[s]" [29] were built out of the first digits. As in-and outpatient variables, prescription variables were included as dummies. A further variable indicating the absolute number of different ATC codes prescribed in the respective year was included. After elimination of variables that were either constant or yielded insufficient information, we finally utilized 653 predictor variables. In case an individual died within a year of feature determination, this individual was removed from the analysis as the cost within the next year would be zero. If an individual died within a year where the outcome was determined, it was included in the analysis.

Outcome
For outcome definition, we followed previous research [7,[11][12][13][14]16] and defined HCP as to be among the top five percent (i.e. above the 95 percent percentile) of the cost distribution based on all observed costs, namely costs for hospital care, outpatient care and prescription drugs. Together, these costs account for more than two third of the German statutory health insurance systems total spending [30].

Performance indicators
Our main measure of performance was the AUC. Further, we report sensitivity, specificity, accuracy and the G-mean. Accuracy is defined as the number of true positives (TP) and true negatives (TN) divided by the sum of TP, TN, false positives, and false negatives [31]. The G-mean, also referred to as "geometric mean" [17], is the square root of the product of sensitivity and specificity [17]. Additionally, the area under the precision-recall curve was derived (AUC-PR). The AUC-PR acts as an important measure of robust predictive performance for imbalanced data, assessing the trade-off of sensitivity and the true positive rate [32].

Statistical analysis
Machine learning methods. To predict future HCPs we applied three different machine learning algorithms, namely an artificial neural network (ANN) [33], a random forest (RF) [34] and a gradient boosting machine (GBM) [35].
An ANNs consists of multiple layers which consist of multiple units. The first layer is referred to as input layer, the last as output layer. Layers in between are called hidden layers. The ANN passes information through connections between the units (neurons) and layers [36]. When connections form cycles, the ANN is called recurrent ANN, while if connections are acyclic, it is called feed-forward ANN [37]. Each connection has a certain weight which represents the strength of the connection [36]. Weight values are derived by back-propagation [38]. Each neuron passes on an output called "activation" [33] to the next neuron(s), which then transform(s) it via an activation function (which can be of various forms) and forward(s) the information to the next neuron(s) in the system [33]. As a whole, the ANN is able to represent non-linear functions [36]. Certain parameters such as the activation function, the number of units within each layer and the number of layers can be tuned by the researcher [33]. For the purpose of this study, we applied a feed-forward ANN [38]. We varied the number of hidden layers [1, 2 or 3], learning rate [0.003, 0.005, 0.007], activation function [sigmoid (with/ without dropout), maxout (with/without dropout), rectifier (with/without dropout)] and units [10,20 or 100] within each layer. As measure of fit for weights, we applied the cross-entropy loss function [33,38].
The RF is defined as a collection of classifiers with a tree structure, where the trees act as "independent identically distributed random vectors" [34]. Each tree gives a vote about which class the input belongs to (i.e., it classifies). Finally, the whole forest classifies the input as a consequence of the majority vote across it's trees [34]. The depth of the trees can be specified. While increasing the trees depth reduces bias, increasing the number of trees reduces the forests variance, if the trees have low correlation. Low correlation is reached by choosing randomly selected variables for each tree grown, independently of the former tree [33]. For the RF applied in this paper, we varied the number of trees (500 or 1000) and the number of randomly selected features at each split (3, 5, 10 or p 653). The maximum depth for each tree was set to 20. As the RF, the GBM is also tree-based algorithm. The mathematical derivation consists of function estimation, numerical optimization, and the application of gradient boosting algorithm with steepest descent [35]. In contrast to RF, the GBM's trees are not independent of each other. Each tree derived in each step depends on the models fit from the former steps [33]. We varied the number of trees (100, 250, 500 or 1000) and the maximal depth (1 or 2) as tuning parameters in our application.
Previously mentioned hyperparameters were tuned for all models. Hyperparameter optimization for each method was performed for each model using grid search [39,40]. Final hyperparameters selected for each model are shown in S2 Table. As in comparable studies [7,17], a logistic regression was used as baseline comparison model. Statistical difference between the AUC values of different models was assessed, where relevant, using the method of Delong et al 1988 [41].
Training data, test data and cross-validation. To train, validate and test our models, we followed common procedures and split our dataset into training (75%) and test (25%) data. Therefore, we followed Moturo et al. [17] and Osawa et al. [7] and used a training dataset with variables from a year t 0 and HCP status data (= outcome) from a following year t 1 , so that the models predicted outcomes of t 1 with data from t 0 . On this training dataset, 5-fold cross-validation (CV) was performed. In general, k-fold CV is a procedure in which the dataset is split into k parts. Then, k-1 parts are used for training, and the part not used for training is used for validation. This is done k times, until each part of the dataset is used once for validation and k-1 for training. Based on models' performance during CV, we performed model selection [33]. That is, we applied CV to all algorithms (RF, ANN, GBM) with different hyperparameter specifications and chose the best performing hyperparameter specification for each algorithm. Nevertheless, since we selected models based on their performance with CV on training data, we could not rule out that our models overfit with respect to the training data [33]. Therefore, we applied the selected model of each algorithm to an independent unforeseen test dataset for performance assessment. Before being applied, the selected models were trained again on the full training dataset. The test dataset was constructed of variables from t 1 and outcomes of t 2 . Within this study, t 0 , t 1 and t 2 correspond to the years 2016, 2017 and 2018, respectively. Analysis.
To run the desired model in the statistical software R, version 4.0.5, the package H2O (cluster version 3.36.0.4) from the company H2O.ai was used (see [42] for further information).
To identify the most important variables among the 653 predictors, we performed a variable importance analysis for each algorithm. Standardized methods for calculating variable importance are included in the h2o package in R. For the ANN, the h2o package exploits the approach by Gedeon [43,44]. This method sums the magnitude of the weights of the neurons' connections for each input variable and derives its relative importance. For the GBM and the RF, the h2o package calculates the variable importance as the change in squared error induced by the inclusion of the respective variable [45]. Additionally, for the best performing model, we conducted a SHapley Additive exPlanations (SHAP) analysis [46] to explore how the most important variables influenced classification based on the test dataset. SHAP analysis, an approach with origins from game-theory, illustrates how each variables value influenced the respective ML models predicted class probability for each individual observed. It additionally ranks variables with respect to their influence on predicted class probability for the whole sample [47,48]. Due to different approaches of deriving the variables importance, SHAP analysis derived variable importance may differ from the variable importance derived by squared error reduction.
95% asymptotic confidence intervals for the all performance measures except the AUC were calculated by the formula: confidence interval = performance measure ± 1.96 � standard error(performance measure), where the standard error (SE) is: SE = square root[performance measure � (1-performance measure) / n] [49]. For AUC, the confidence intervals were derived using the method of LeDell et al 2015 [50].

Summary statistics
HCPs accounted for about 46.8 percent of the total costs (Table 1). They had a longer average duration of care dependency, a substantially higher age (about 20 years on average), a substantially higher share of individuals older than 65 and spent more time in a DMP, on average. Further, the average costs of a HCPs were about 24,983 € per year, while, for non-HCPs, the average costs are only 1,496 € per year. Besides that, HCPs showed higher values for variables indicating multimorbidity such as the number of different hospital or outpatient diagnostic categories as main diagnosis as well as different ATC prescription subcategories. Note that all differences between the two groups means were statistically significant (column 3 of Table 1).

Model selection
As described, we varied various tuning parameters for each algorithm to perform model selection. The selected models are presented in Table 2. We found the GBM with a maximal depth of two and a number of 500 trees performed best. For the random forest, the model with the number of variables randomly selected at each node set to the square root of the number of all available variables as well as with 500 trees worked best. Finally, the best performing ANN had two hidden layers, 10 units per hidden layer, the maxout activation function with dropout and a learning rate of 0.003 (see also S2 Table).
On validation data, RF and the GBM significantly outperformed LR and the ANN (see non-overlapping confidence intervals in Table 2). The RF reached the highest AUC with a value of 0.882. LR and ANN performed equally good ( Table 2).

Performance evaluation
To evaluate algorithm performance, we applied our selected models to unforeseen test data of the subsequent year. Therefore, models selected based on the validation dataset performance were trained with the full dataset of 2016/2017 as described in the methods section, and then applied to the test dataset of 2017/2018. As on training data, RF remained the best performing model on test data (Table 3), however not statistically significantly different from the GBM (p-value = 0.076 derived by the method of Delong et al 1988 [41]). Further, both GBM and RF outperformed the ANN and LR statistically significantly (p-values: GBM vs. ANN = 0.000; GBM vs. LR = 0.000; RF vs. ANN = 0.000; RF vs. LR = 0.000) (see Fig 1 for AUC plots).
The final tuning parameter values for all models can be found in S2 Table. We also show the confusion matrix of the predictions at the sensitivity-specificity trade-off that maximized the g-mean (S3 Table). Precision-recall curves were plotted and the area under the precisionrecall curve derived in Fig 2. It could be observed that performance for all models decreased, compared to the AUC. Performance decline for RF, GBM and LR was slightly above 0.1. Predictive ability remained reasonable. However, the NN's performance decreased dramatically to only 0.43 on the AUC-PR, implying that the NN was not able to reasonably deal with the imbalanced data when trading-off sensitivity and the true positive rate. We tested the "bal-ance_class" parameter in the h2o.deeplearning() method and found that this did not improve results to a noteworthy extend.

Variable importance
Variable importance was calculated as described in the 'Analysis' section. For simplicity, we only report the top five predictors for each algorithm. Fig 3 shows the five most important variables identified by each algorithm. The most important predictor is set to 1 (see x-axis), while the other four top five predictors are ranked with their importance relative to the most important predictor (see their x-axis values). All algorithms ranked patients age, HCP dummy of the current year and total costs of the current year among the top five most important predictors for predicting next year's HCPs. The ANN also ranked the number of outpatient diagnosis and the number of ATC codes as most important predictors. The GBM ranked one diagnostic code (B.20 -the code for an infectious disease as a consequence of a HIV-disease) [51] and the anatomical main group L for antineoplastic and immunomodulating agents [52] among the top predictor variables. SHAP analysis revealed further insights into how variables influenced predicted class probabilities for the best performing model (RF).
One can observe that the ranking of the variables in the SHAP analysis differs from the one of the previously shown variable importance plots. This is due to the different calculation methods (squared error reduction vs. influence on predicted probabilities, see 'Analysis'). Each dot in Fig 4 represents an individual. A blue dot indicates a small value of the respective variable, while a red dot indicates a high value. The x-axis represents the influence of the variable on the predicted class probability. Thus, we see that having been a HCP, high costs, the intake of medications from the ATC therapeutic subgroup 'L40' (immunosuppressants), the intake of medications of the ATC anatomical main group 'L' (antineoplastic and immunomodulating agents) and high age in the last year increase the likelihood of becoming a HCP within the next year.

Discussion
In line with various previous studies [1-3, 5-7, 11], we found that HCPs, defined as top five percent among the cost distribution, accounted for almost half of healthcare costs in a dataset from a large German sickness fund.
For predictions with ML algorithms, we relied on sickness fund claims data which comes without extra costs for sickness funds to be collected. Other authors with the aim to predict HCPs exploited additional variables such as clinical data [7,11] or behavioral data [11], or relied solely on different data such as text from patient records [12]. Further survey data was often used [11,14,15], which is not as continuously available as sickness fund data, jeopardizing the aim of making continuous up-to-date predictions for the underlying population. While including additional information such as clinical, socioeconomic, or behavioral (survey) data may be beneficial, in practice it is more difficult to be collected on a large scale, up-to-date and at low costs.
To our knowledge, our developed models outperformed all other models from previous studies that aimed to predict HCPs based on AUC (see S4 Table). Our tree-based models reached 'good' performance based on AUC. Only one other study [16] reported a higher AUC while applying a LR. However, this study [16] exploited hospital claims, not sickness fund data, thus relying only on patients already receiving medical treatment associated with inpatient care. This makes their setting and data hard to compare with the setting and data considered in this study. Comparing our results to other studies with a comparable study population, we argue that the high performance of our models is noteworthy since we did not include additional information on top of routinely collected health insurance data, such as others did [7,11]. This finding shows that routinely collected health claims and demographic sickness fund data are sufficient to establish high-performing prediction models. We also found that logistic regression was statistically significantly outperformed by tree-based ML algorithms, in line with previous research [7,17,24,[53][54][55] and to our knowledge a new contribution to the research question at hand. For imbalanced datasets such as ours, balancing the datasets might improve model performance [56]. E.g. Moturo et al. [17] found that balancing classes improves model performance. In contrast, we did not find that balancing the training data did improve predictive performance of our algorithms, even though we tested this approach by using the balance_classes command in the h2o package. This command balances the classes either by under-or oversampling [57]). It is also of relevance that despite model performance issues, the AUC-which was our main outcome metric-is not negatively affected by  imbalanced datasets [58]. In addition to the AUC, we plotted precision-recall curves to assess the trade-off of sensitivity and the true-positive rate among decision thresholds. Tree-based models performed best on this metric, showing reasonable performance albeit the highly skewed data with respect to the small minority class. Further, we found that the model selection procedure with a validation dataset was able to identify models which also performed best on the test data or did not differ statistically significant in their performance from the best model based on test data. Nevertheless, we do not know how our models perform on datasets from other sickness funds (i.e. other populations). Depending on the dataset size and the populations characteristics, other algorithms and model specifications than found in this application may turn out to perform best. Therefore, we recommend repeating the process of training and validation including model selection applications on other datasets. If feasible, a comparison to the models identified through model selection in this paper might be performed to see how the same algorithm specifications perform on different data.
The findings of our study highlight that ML approaches can identify future HCPs with reasonable performance using only routinely available data, thus opening the possibility to reach this population with specific measures. If effective measures that prevent potential HCPs with a high probability from actually becoming HCPs can be implemented, substantial expenditures can be prevented at healthcare system level. This facilitates a shift from healthcare system focusing on sickness towards a healthcare system focusing on prevention and proactive caseand care-management [59].
Most important predictors found in this study were in line with previous evidence. Current healthcare costs [7] and age [7,11,13,14] were also most important predictors in previous research. The fact that a substantial amount of HCP patients remain so for the following year [1,6] may explain why the HCP indicator for the current year has been ranked as highly important predictor for becoming a HCP in the following year. As SHAP analysis revealed, high costs/being a HCP in the current year is associated with an increased probability of being HCP in the next year, as was higher age. Adding to previous evidence, we found that the number of in-or outpatient diagnosis as well as the number of ATC prescriptions may act as top predictors for identifying future HCPs. These variables are easy to create in sickness fund data and therefore provide potential to improve predictive performance with low effort.
Nevertheless, the study comes with several limitations. First, we did only include variables of one year to predict future HCPs. Other authors used data of more than one year, even though they found that adding data of further years did not improve performance notably [7]. Second, certain variables identified to be highly important predictors by algorithms are either unchangeable (age) or indicate patients already have high healthcare utilization (total costs/ current HCP), thus limiting potential for preventive measures. Third, we did only exploit structured, routinely collected healthcare claims data available to sickness funds. There are also other, more unstructured sources of data permanently produced in the healthcare sector, e.g., discharge letters. If medical text data such as discharge letters were included, predictive performance might have further improved. Former research already indicated that models facilitating text data are able to make predictions about HCP status with reasonable performance [12]. Fourth, ML is a fast evolving and dynamic research field. New algorithms are developed on a rapid pace (see e.g. [60,61] for recently developed methods). Further research may therefore apply newly developed ML algorithms to comparable datasets and test whether they are able to outperform well-established ones such as those applied in this paper. Further research is also needed to explore whether adding more unstructured but routinely produced data sources such as text data might yield additional benefits when used on top of claims data. In general, adding other data sources (structured or unstructured) may shed a light on how and which further data sources improve performance on top of sickness fund claims data.
Future research is required to develop effective measures for patients which are predicted to be future HCPs. Additional variables which can be created out of health insurance datasets might be subject to future research. Once implemented in practice, algorithms need continuous training to catch up with medical innovation, which continuously changes costs and healthcare outcomes for various indications.

Conclusion
We were able to apply common ML algorithms to predict future HCPs with 'good' performance, exploiting routinely collected sickness fund data. Some important predictors were in line with previous evidence, while we further found that the number of in-or outpatient diagnosis and drug prescriptions are highly relevant predictors. Apart from 'good' predictive models, preventive measures will be necessary in order to derive benefits from accurate predictions.
Supporting information S1